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1.  INTRODUCTION 


The  overall  goal  of  any  eomputer  arehiteeture  ereated  in  silieon  is  to  faeilitate  an  idea  to  a  final 
design  in  an  effieient  and  praetieal  way.  To  aoeomplish  this  goal  many  designs  are  ereated 
through  eomplex  and  elaborate  software  programs  that  write  netlists  or  struetural  deseription  of 
silieon  struetures  by  means  of  a  eoneise  software  system  or  by  produeing  a  design  flow.  When 
engineers  first  started  ereating  these  silieon  struetures,  the  number  of  transistors  or  the 
integration  of  silieon  deviees  that  existed  within  a  design  was  simple  and  straightforward. 
However,  as  the  eomplexity  of  eomputer  arehiteetures  inereased  over  time,  many  of  these 
software  tools  and  design  flows  were  demanding,  elaborate,  and  extremely  eomplex.  Therefore, 
engineers  resorted  to  ereating  and/or  modifying  software  tools  to  help  effieiently  eontrol  the 
eomplexity  of  these  implementations  for  the  ultimate  goal  of  produeing  Very  Large  Seale 
Integration  (VLSI)  eomputer  arehiteetures. 

Although  there  are  many  open-souree  software  tools  and  design  flows  to  help  ereate  high- 
performanee  eomputer  arehiteeture  designs,  they  seldom  produee  results  that  are  on  par  with 
eommereial  VLSI  software  tools.  This  oeeurs  beeause  many  VLSI  software  tools  are  produeed 
by  Eleetronie  Design  Automation  (EDA)  eompanies  that  have  huge  budgets  that  hire  many 
programmers  to  taekle  on-going  researeh  problems  within  eomputer  arehiteeture 
implementations.  Although  many  publiely  available  eomponents,  standard  eells,  and  high-level 
System  on  a  Chip  (SoC)  deseriptions  are  available  for  these  VESI  tools,  they  are  diffieult  to  use 
due  to  their  high  amounts  of  eomplexity.  This  researeh  aims  at  bridging  the  gap  of  what  ean  be 
done  with  eomplex  SoC  deseriptions  of  eomputer  arehiteetures  eombined  with  ereating  design 
flows  targeting  eommereial  EDA  design  tools. 

Reeently,  a  National  Seienee  Eoundation  panel  (NSE)  highlighted  the  urgent  need  that  aeeurate 
modeling  and  evaluation  of  eomplex  arehiteetures  to  be  a  signifieant  ehallenge  for  system 
arehiteetures  and  digital  system  designers  [1].  Ultimately,  the  NSE  panel  argues  that  simulation 
and  benehmarking  will  require  a  signifieant  leap  in  eapability  within  the  next  few  years  to 
maintain  ongoing  innovations  in  eomputer  systems  and  eleetronies.  Even  after  several  years 
sinee  this  seminal  report  illuminated  what  was  needed  in  eomputer  arehiteetures,  there  are  still 
huge  gaps  in  what  ean  be  designed  and  budgeted  for  this  task  [2].  Consequently,  there  is  a  need 
for  an  effieient  and  reliable  system  that  ean  be  utilized  for  produeing  state-of-the-art  eomputer 
arehiteetures,  espeeially  for  silieon  implementations. 

The  researeh  objeetives  of  this  work  is  to  design,  develop,  and  evaluate  multi-eore  hardware 
support  for  eomputer  arehiteetures  at  the  nanometer  level.  Many  of  these  arehiteetures  are 
eurrently  or  will  be  employed  in  advaneed  arehiteetures  that  may  have  seeure  eapabilities  within 
the  Air  Eoree  Researeh  Eaboratory  (AERL)  in  Rome,  NY.  This  will  be  aeeomplished  by 
designing  eomplete  design  flow  integration  with  eommereial  and  open-souree  EDA  tools.  The 
design  flow  will  take  as  inputs  a  high-level  system-level  arehiteeture  deseription,  along  with 
area,  eritieal  path  delay,  and  power  dissipation  eonstraints.  Based  on  the  SoC  arehiteeture 
deseription  and  design  eonstraints,  the  tools  will  automatieally  generate  synthesizable  HDE 
models,  embedded  memories,  and  eustom  eomponents  to  implement  the  speeified  VESI 
arehiteeture.  It  is  antieipated  that  the  results  of  this  work  will  be  a  step  eloser  to  the  guidelines 
outlined  by  the  aforementioned  NSE  panel  [1]. 
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A  key  component  of  the  design  infrastructure  will  be  that  the  tools  will  also  generate  simulation, 
synthesis,  and  place-and-route  scripts  and  interfaces  for  the  VLSI  architecture,  which  can  be 
used  in  conjunction  with  industry-standard  design  tools  from  Cadence  Design  Systems, 
Synopsys,  and  Mentor  Graphics  Corporation  to  obtain  area,  delay,  and  power  estimates. 
Feedback  from  the  design  tools  can  then  be  used  to  modify  the  architecture  description  or  design 
constraints,  if  necessary.  An  important  part  of  these  design  tools  is  to  evaluate  methodologies  to 
achieve  low  power  designs  and  ensuring  the  design  tools  do  not  add  malicious  circuits  and  are 
secure. 
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2.  BACKGROUND 


Computer  architectures  are  complicated  by  their  fabrication  creating  complicated  and  elaborate 
vehicles  to  produce  these  silicon  structures.  During  the  1970s  and  up  through  the  1990s,  most 
computer  architectures  were  created  through  vast  layers  of  silicon  deposited  on  a  silicon 
substrate  [1].  These  depositions  were  usually  fabricated  in  massive  clean  rooms  that  cost 
millions  to  build  and  maintain.  However,  as  engineers  progressed  into  the  21st  century, 
computer  architectures  have  also  been  able  to  be  integrated  through  Field  Programmable  Gate 
Arrays  (FPGAs).  Although  FPGAs  are  easy  to  design,  and  are  far  cheaper  than  most  traditional 
computer  architectures  created  through  silicon,  they  consume  significantly  more  area,  delay  and 
power  [3].  Consequently,  for  computer  architectures,  which  demand  high  amounts  of 
performance,  silicon  high-performance  systems  are  usually  chosen. 

The  demand  for  increased  speed,  decreased  energy  consumption,  improved  memory  utilization, 
and  better  compilers  for  processors  has  become  paramount  to  the  design  of  the  next  generation  of 
computer  architectures.  To  make  matters  worse,  the  traditional  challenges  of  designing  digital 
devices  with  semiconductor  technology  has  drastically  changed  with  the  introduction  of  deep 
submicron  technology.  Designs  that  have  been  expanding  Moore’s  Law  have  discovered  that 
silicon  technology  has  severe  limitations  within  technologies  below  180  nm  [3].  What  was  once 
easy  to  improve  a  design  by  scaling  the  minimum  feature  size  of  a  transistor  can  no  longer  be 
simply  scaled. 

Because  silicon  technologies  are  so  small,  designs  can  now  implement  billions  of  transistors  on  a 
reasonably  small  die.  Unfortunately,  this  leads  to  power  density  and  total  power  dissipation  that 
is  at  the  limits  of  what  packaging,  cooling,  and  other  infrastructure  can  support  [4].  More 
importantly.  Complementary  Metal  Oxide  Semiconductor  (CMOS)  technologies  below  90nm 
have  leakage  current  that  almost  matches  or  surpasses  that  of  dynamic  power,  making  power 
dissipation  a  major  obstacle  to  designing  complex  SoC  designs  [3]. 

Although  power  dissipation  complicates  the  process  to  which  integrated  circuits  can  be 
produced,  it  does  not  necessarily  mean  that  designs  cannot  be  efficiently  designed.  A  designer 
just  has  to  be  cognizant  that  performance  does  not  necessarily  mean  that  one  can  increase  the 
clock  rate  as  technologies  grow  smaller.  This  new  challenge  requires  designers  to  realize  that 
both  power  and  speed  are  closely  linked  and  that  engineering  choices  are  normally  required  if  a 
design  requires  a  lower  power  factor  and  high  clock  rates  [5]. 

To  make  things  worse,  processor  designers  have  increased  core  counts  to  exploit  Moore’s  Law 
[6].  This  has  made  decisions  about  having  multiple  cores  and  the  performance  that  it  entails 
sometimes  difficult  to  navigate.  More  importantly,  single  core  processor  designs  are  the  engine 
that  ultimately  makes  multiple  core  devices  work.  That  is,  for  virtually  all  applications, 
including  single-core  general-purpose  computer  architectures,  reducing  the  power  consumed  by 
SoCs  is  essential  to  allow  new  features  that  improve  multiple  core  technology.  Consequently,  it 
is  important  to  understand  what  and  how  power  consumption  affects  SoC  designs  to  improve 
upon  it. 
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2,2  Power  Dissipation 

The  total  power  consumption  of  a  digital  logic  circuit  consists  of  two  major  portions  [6,  7].  The 
first  part  consists  of  dynamic  power  that  is  the  power  that  is  consumed  when  a  device  is  active. 
Typically,  dynamic  power  is  consumed  when  devices  are  active  and  are  switching  back  and 
forth.  That  is,  they  are  based  on  what  is  supplied  at  the  input  of  a  circuit.  For  example,  a  circuit 
has  lots  of  activity  (e.g.  within  a  router  for  the  Internet),  it  will  typically  consume  lots  of  dynamic 
power.  Conversely,  applications  that  only  switch  on  during  critical  events  (e.g.  sensors  within 
automobiles  for  abnormal  events),  they  typically  consume  low  amounts  of  dynamic  power. 

Dynamic  power’s  main  function  is  the  amount  of  switching  that  occurs  during  an  event  [8]. 
Since  most  CMOS  circuits  are  composed  of  layers  of  Silicon  Dioxide,  which  is  an  excellent 
storage  device,  a  majority  of  the  switching  power  stems  from  the  power  that  is  charged  and 
discharged  from  turning  the  transistor  on  and  off,  respectively.  This  typically  results  in  a 
squared  dependence  on  the  voltage; 

p  .v  ^  ■  P  ■  f 

dyn  DD  irons  J  clock  ^  ^ 

where  Cl  is  the  load  capacitance,  Vdd  is  the  supply  voltage,  f  is  the  frequency  of  the  system 
clock,  and  Ptrans  is  the  probability  of  an  output  transition. 

In  addition  to  switching  power,  internal  power  also  contributes  to  dynamic  power.  Internal 
power  occurs  when  a  CMOS  gate  is  suddenly  turned  from  on  to  off  and  back  to  on.  This 
switching  causes  both  NMOS  and  PMOS  transistors  to  be  ON  momentarily  resulting  in  a  short 
circuit  or  “crowbar”  current.  Although  the  short  circuit  can  be  small,  it  can  contribute  to  the  total 
dynamic  power  if  the  input  is  ramped  up  too  quickly  [9].  Short-circuit  power  can  be  described 
as: 


^ss  ^  sc  ^DD  ^ petJ^  -f clock  (2) 

where  tsc  is  the  time  duration  of  the  short-circuit  current  and  Ipeak  is  the  total  internal  switching 
current.  Although  short-circuit  current  will  not  be  discussed  in  this  report,  it  is  important  to 
make  sure  gates  are  not  floating  on  an  output  when  turning  certain  power-gating  a  circuit  for 
lower-power  consumption. 

On  the  other  hand,  static  power  dissipation  is  defined  as  the  power  consumed  when  devices  are 
powered  up  and  no  signals  are  changing  values  [8].  In  the  past,  static  power  dissipation,  which  is 
mainly  dominated  by  leakage  current  in  a  gate,  was  either  non-existent  or  did  not  significantly 
impact  a  design.  However,  as  the  voltage  and  minimum  feature  size  of  a  transistor  gets  smaller, 
the  pronounced  effect  of  leakage  within  a  gate  makes  static  power  dissipation  almost  equal  to  or 
greater  than  dynamic  power  below  90nm  [10]. 

In  the  past,  traditional  designs  have  resorted  to  lowering  the  power  supply  to  get  an  exponential 
decrease  in  the  power.  This  decision  has  been  substantiated  by  Equation  (l)’s  power  dissipation 
being  dependent  on  the  square  of  the  supply  voltage.  The  real  problem  is  that  lowering  the 
supply  voltage  causes  the  drain  to  source  current  of  a  transistor  to  decrease.  The  drain  to  source 
current  can  be  approximated  by: 
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(3) 


^DS 


W  (Vgs-Kf 


where  |a,  is  the  carrier  mobility,  Cox  is  the  gate  capacitance,  W  and  L  are  the  dimensions  of  the 
transistor,  Vx  is  the  threshold  voltage,  and  Vqs  is  the  gate-source  voltage.  Since  deep  submicron 
technologies  have  low  supply  voltages,  having  a  low  threshold  voltage  allows  CMOS  designs  to 
maintain  good  performance  [8].  Unfortunately,  as  the  threshold  voltage  gets  smaller,  an 
exponential  increase  in  the  sub-threshold  leakage  current  (Isub)  occurs. 

The  subthrehsold  leakage  current  is  the  major  dominant  element  within  static  power  dissipation 
[6].  It  occurs  when  a  CMOS  gate  is  not  turned  completely  off.  A  good  approximation  to  the 
subthreshold  equation  is  shown  in  Eq.  4,  where  k  is  Boltman’s  constant,  T  is  the  temperature  in 
Kelvin,  q  is  the  charge  of  an  electron,  and  n  is  a  function  of  the  device  fabrication  process.  The 
sub  threshold  leakage  current  for  sub-90nm  transistors  is  the  major  source  of  conflict  within 
current  technologies,  such  as  the  IBM  cmosIOsf  65nm  technology  used  in  this  work.  In  the  past, 
static  power  from  leakage  power  was  significantly  lower  than  dynamic  power,  however,  with 
newer  technologies  and  shrinking  power  supplies,  static  power  dissipation  now  is  the  dominant 
factor. 


^SUB 


k-T 

q 


T 


(Vgs-Vt)i 

nk-T 


e 


(4) 


Equation  (4)  indicates  that  sub-threshold  leakage,  which  is  the  predominant  factor  in  static  power 
dissipation,  depends  exponentially  on  the  difference  between  VGS  and  VT.  Therefore,  as 
technology  scales  the  power  supply  and  Vx  down  to  limit  the  dynamic  power,  leakage  power 
grows  exponentially,  as  was  shown  in  [11].  To  make  matters  worse,  sub-threshold  voltage 
current  increases  exponentially  with  temperature,  which  also  complicates  the  process  for  low- 
power  design. 

Transistors  are  usually  defined  by  their  length  and  width,  however,  the  former  usually  establishes 
the  minimum  feature  size  of  a  transistor  [6].  As  technology  moves  towards  smaller  feature  sizes, 
the  thickness  of  the  oxide  below  the  gate  of  a  transistor  also  decreases  in  thickness. 
Unfortunately,  in  current  semiconductor  processes  the  thickness  of  the  oxide  is  only  several 
atoms  thick.  Consequently,  the  thinness  of  the  oxide  establishes  a  current  that  tunnels  through 
the  gate  towards  the  channel  of  a  transistor,  so  much  so  that  in  current  processes  gate  leakage  can 
be  nearly  1/3  as  much  as  sub-threshold  leakage  [7].  In  order  to  reduce  the  gate  leakage,  some 
manufacturers  have  resorted  to  high-K  dielectric  materials,  such  as  Hafnium,  to  keep  the  gate 
leakage  in  check  [11]. 

Another  technique  to  reduce  the  leakage  current  is  to  use  multi  threshold  voltage  transistors. 
Using  this  technique,  high  Vx  cells  can  be  utilized  wherever  performance  goals  allow  the  power 
dissipation  to  keep  in  check.  Specifically,  having  transistors  that  can  utilize  different  threshold 
Voltages,  usually  associated  with  Multi-Threshold  CMOS  (MTCMOS)  circuits,  allows  the 
reduction  of  the  substrate  current,  as  shown  in  Equation  (4).  And,  lower  Vx  cells  can  be  used  on 
a  critical  path  to  meet  a  specific  timing,  because  lower  threshold  voltages  have  faster  propagation 
delays  as  they  switch  faster  [9].  The  technology  utilized  for  this  work  is  IBM’s  cmosIOlpe  65nm 
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[12]  and  cmos32soi  32nm  [13].  Both  technologies  enable  the  use  of  regular  Vt,  high  Vt,  and 
low  Vt  standard-cell  transistors  to  reduce  the  gate  leakage  and  improve  speed  gains  on  the 
critical  path. 

2,2  System  on  Chip  Design  Flow 

Many  design  flows  involve  taking  structural  or  behavioral  descriptions  of  computer  architectures 
and  translating  them  into  a  working  silicon  mask  layer  that  can  be  fabricated.  Although  this 
process  is  just  an  evolution  what  software  compilers  use,  this  process  has  dramatically  changed 
from  early  designs  involving  several  hundred  transistors  to  current  System  on  Chip  designs  that 
encompass  close  to  or  exceed  1  billion  transistors  [14].  To  make  matters  worse,  power  and  high- 
performance  issues  have  complicated  the  entire  process  [5]. 

Standard-cell  designs  involve  taking  pre-made  layout  elements,  such  as  an  AND  or  NAND  gate, 
and  having  software  stitch  elements  together  via  placing  each  layout  and  routing  wire  between 
known  pins.  Early  layout  editors,  such  as  the  Magic  Layout  Editor,  had  built-in  routers  to  allow 
designers  to  avoid  having  to  worry  about  laying  out  wire  between  two  points  [15].  However,  as 
more  points  were  created  between  a  pin  and  the  cost  for  a  given  route  increased,  there  was  a 
dramatic  need  for  more  algorithms  to  deal  with  congestion  and  efficiency  [16]. 

Software  has  been  written  to  translate  or  parse  high-level  descriptions  of  digital  systems  into  a 
representation  that  allows  hardware  to  optimize  and  map  to  a  standard-cell  library.  More 
importantly,  many  of  these  points  that  are  parsed  and  subsequently  lexed  within  a  software  tool 
can  be  connected  from  standard-cell  parts  to  another  standard-cell  part,  custom-cell  part  or  pin. 
Therefore,  it  is  important  that  software  can  translate,  optimize  and  map  a  high-level  description 
into  these  netlists  accurately  and  concisely.  Typically,  this  process  of  translating,  optimizing, 
and  mapping  is  called  synthesis  [6]. 

After  synthesis,  netlists  can  be  utilized  to  place  standard-cells,  custom-cells,  input/output  pins 
and  drivers,  memories,  and  other  ancillary  parts  onto  a  grid.  This  design  will  be  optimized  for 
placement  by  its  wire-length,  power  connections,  and  other  elements  of  cost  associating  with 
each  tool.  Consequently,  the  process  from  going  from  idea  to  final  mask  layer  for  silicon 
fabrication  can  be  broken  into  two  distinct  phases:  front-end  processing  and  back-end  processing 
[17]. 

Another  important  concept  is  that  many  front-end  and  back-end  tools  utilize  heuristics  to 
accomplish  their  algorithm.  That  is,  they  tend  to  have  NP-hard  (i.e.,  non-deterministic 
polynomial-based  in  solving)  [17].  Therefore,  each  time  an  algorithm  runs  it  may  result  in  a 
different  outcome,  yet  close  to  an  optimal  answer  [6].  This  was  one  of  the  main  reasons  this 
research  incorporates  tools  from  professional  EDA  vendors  in  that  they  can  produce  the  best 
outcome  giving  a  set  of  high-level  netlists  and  constraints. 

The  front-end  usually  is  associated  with  synthesis  and  any  preliminary  placement  of  parts  that 
have  been  subsequently  mapped  during  the  synthesis  process.  Some  tools  have  been  able  been 
able  to  pre-place  parts  to  aid  in  the  synthesis  process  (e.g.,  through  topologically  mapping  in 
Synopsys  Design  Compiler),  however,  most  flows  usually  start  the  front-end  process  by  first 
synthesizing  and  then  placing  parts  initially  onto  a  grid. 
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The  grid  is  important  in  that  it  allows  all  the  pins  to  conneet  together  and  a  well-defined  grid 
helps  the  front-end  get  its  job  done  quiekly  and  aeeurately  [17].  It  also  simplifies  the  proeess  in 
figuring  out  wire  length,  which  is  crucial  to  many  constraint-driven  EDA  tools  [16].  A  grid  that 
is  chosen  that  is  too  big  or  not  large  enough  may  result  in  an  objective  that  does  not  meet  a  cost 
criterion  or  worse  yet,  a  placement  that  cannot  connect  all  pins  for  a  given  netlist.  To  help 
designers,  most  technology  kits  that  come  from  commercial  fabrication  sites  choose  the  grid  for 
their  users.  In  this  paper,  the  technology  kit  comes  from  IBM  and  are  all  drawn  at  5  nm. 

The  back-end  involves  the  numeric  crunching  that  occurs  once  an  initial  placement  of  parts  is  set 
to  a  grid.  During  the  back-end  process  the  software  tools  typically  move  some  of  the  placement 
around  and  finally  place  &  route  the  pins  together.  Each  given  design  has  a  constraint  for  a 
given  objective,  whether  its  power  dissipation,  energy  consumption,  and  fast  critical  paths.  The 
back-end  may  also  report  timing  and  power/energy  reports  that  help  users  accurately  report  on  a 
given  design. 
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3.0  METHODS,  ASSUMPTIONS  AND  PROCEEDURES 


The  goal  of  this  paper  is  to  researeh  and  develop  techniques,  tools,  and  flows  for  high-level 
synthesis  of  SoC  platforms  in  deep  sub-micron  CMOS  technologies  that  (1)  provide  the  ability  to 
efficiently  integrate  embedded  memories,  processors,  hardware  accelerators,  and  communication 
structures,  (2)  utilize  synthesis  and  layout  information  to  accurately  estimate  area,  delay,  and 
power  from  high-level  SoC  architecture  descriptions,  (3)  facilitate  design-space  exploration  and 
component  reuse  in  multiple  core  SoC  solutions,  and  (4)  are  well  documented,  easy  to  use.  This 
goal  will  be  accomplished  by  researching  and  developing  high-level  design  flows  for  complete 
SoC  solutions  and  using  computer  tools  to  explore  new  techniques  for  creating  fast  critical  paths 
and  exploiting  power  management. 

The  high-level  synthesis  tools  will  take  as  inputs  a  high-level  SoC  architecture  description,  a 
parameterized  library  of  configurable  SoC  components,  and  design  constraints.  The  tools  use 
these  inputs  to  generate  synthesizable  HDL  models,  embedded  memories,  and  custom 
components  to  implement  the  specified  SOC  architecture.  The  tools  also  generate  simulation, 
synthesis,  and  place-and-route  scripts  for  the  SoC  architecture,  which  are  used  in  conjunction 
with  industry-standard  design  tools.  A  major  element  of  the  work  produced  in  this  research  is 
that  a  variety  of  commercial  tools  can  be  used  together  or  separately.  To  accomplish  this  feat, 
specific  interfaces  are  created  that  allow  many  of  the  tools  to  exchange  information  together. 

In  addition  to  the  design  flows  and  tools,  this  research  produced  hardware  accelerators, 
functional  units,  processors,  memories,  and  communication  structures  for  use  in  low-power  SoC 
systems,  such  as  multimedia  PDAs  and  digital  cameras.  These  components  are  characterized  in 
terms  of  area,  delay,  and  power  dissipation  and  used  in  conjunction  with  a  flexible  simulation 
framework  to  facilitate  rapid  design  space  exploration  of  new  SoC  solutions  and  power 
management  techniques.  Power  efficiency  is  targeted  at  the  system,  architecture,  circuit,  and 
layout  levels  to  provide  a  firm  framework  for  the  design  and  evaluation  of  future  applications. 

The  following  elements  were  developed  for  this  project  at  the  Air  Force  Research  Laboratory: 

•  Design  flows  and  SoC  components  for  integration  into  a  complete  System  on  Chip  design 
for  multiple  commercial  EDA  VLSI  tools. 

•  Create  extensible  test  environments  that  allow  for  easy  chip  exploration  and  analysis 

•  Develop  multiple  core  relaxed  consistency  memory  architecture  for  use  within  possible 
AFRL  secures  processor  design  architectures. 

•  Each  of  the  sub  tasks  is  described  in  the  following  subsections.  As  summarized  above, 
the  subtasks  together  focus  on  the  development  of  high-level  EDA  tools  for  low-power 
SoC  designs. 

Although  some  of  the  items  have  been  previously  implemented,  one  of  the  major  elements  to  this 
work  is  the  integration  of  components  for  rapidly  smaller  transistor  sizes.  As  transistors  get 
smaller  and  smaller,  the  major  element  that  impedes  designs  is  wiring  or  interconnect  [6]. 
Consequently,  as  more  and  more  elements  are  put  together  on  a  device,  electrical  effects  such  as 
current  drive  are  important.  For  example,  a  wire  that  only  traversed  a  small  distance  in  previous 
designs  may  in  fact  have  several  microns  to  travel  for  larger  System  on  Chip  (SoC)  designs. 
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Therefore,  wires  may  aetually  not  pass  eorreet  digital  voltages  aeross  a  wire  without  proper 
examination.  These  effeets  are  typieally  ealled  Signal  Integrity  issues. 

Additionally,  one  of  the  deliverables  for  the  work  performed  during  the  this  work  was  to  develop 
Very  Large  Seale  Integration  (VLSI)  design  flows  and  tools,  researehed  and  develop  highly 
optimized  standard  eell  libraries,  funetional  units,  proeessors,  memories,  and  eommunieation 
structures  for  use  in  low-power  small  feature  size  SoC  systems.  These  scripts  are  put  together  in 
order  to  make  sure  Signal  Integrity  issues  as  well  as  variations  within  semiconductor  devices  are 
integrated  together  with  the  scripts.  In  many  cases,  the  scripts  are  documented  with  easy  to  learn 
elements  that  help  understand  how  the  tools  combat  problems  such  as  signal  degradation, 
variation  of  timing  effects,  and  other  anomalies  caused  by  small  transistor  size  components. 

3.1  System  on  Chip  Framework 

This  framework  consists  of  an  integrated  set  of  design  tools  and  flows  that  take  a  high-level  SoC 
architecture  description,  a  parameterized  library  of  configurable  SoC  components,  and  design 
constraints.  A  high-level  architecture  description  can  be  utilized  using  a  combination  of  Verilog 
and  VHDL  Hardware  Descriptive  Languages  (HDL)  and  a  parameterized  description  of  the 
components  used  to  implement  the  system.  For  example,  when  specifying  a  configurable 
processor  the  description  might  contain  information  on  the  datapath  size,  the  number  and  type  of 
functional  units,  and  the  subset  of  available  instructions  supported.  The  tools  then  use  these 
inputs  to  generate  synthesizable  HDL  models,  embedded  memories,  and  custom  components  to 
implement  a  specified  SoC  architecture.  The  initial  target  application  of  the  work  performed  in 
this  work  is  designated  for  the  IBM  cmosIOlpe  65nm  technology,  however,  the  tools  were 
designed  so  that  they  could  easily  be  translated  to  other  technologies.  Specifically,  the  tools  are 
also  targeted  for  IBM  cmos32soi  52nm  technology,  as  well. 

With  the  advancement  of  many  technologies  that  are  involved  in  Electronic  Design  Automation 
(EDA)  software,  the  process  of  creating  high  performance  digital  systems  becomes  possible. 
However,  the  importance  of  creating  a  repeatable  engineering  process  involves  a  delicate  and 
balanced  approach  to  engineering  design  and  utilization  of  this  EDA  software.  Several 
companies  have  been  integral  in  creating  this  process  including  Cadence  Design  System  (CDS) 
®. 


Cadence  Design  Systems®  was  one  of  the  first  EDA  companies  that  specialized  in  designing 
silicon  systems  that  have  designs  utilizing  multiple  methodologies  using  their  Eirst  Encounter 
tool.  Using  this  approach,  modern  designs  can  be  comprised  of  custom-based  silicon  devices, 
designs  created  with  Hardware  Descriptive  Eanguages  (HDEs),  as  well  as  Intellectual  Property 
from  third  party  companies.  To  give  the  tool  more  symbolism  at  its  power  of  pulling  together 
multiple  digital  systems,  CDS  now  calls  its  tool  Encounter  Digital  Implementation™  (EDI).  That 
is,  EDI’s  overall  purpose  is  providing  an  EDA  tool  for  the  assembly  of  Systems  on  Chip  (SoC) 
designs.  Although  EDI  works  efficiently  its  use  is  predicated  on  commands  or  instructions  that 
users  give  EDI  the  ability  to  integrate  each  of  these  layers.  Unfortunately,  as  time  has  progressed 
EDI  has  had  difficulty  in  maintaining  commands  that  are  unified  and  cogent  for  users  to 
accomplish  their  intended  purpose. 

To  make  EDI  more  understandable  and  easy  to  use,  CDS  created  a  sequence  of  commands  or  a 
design  flow  that  is  integrated  as  a  template  for  use  with  their  tool.  These  scripts  are  called  the 
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Encounter  Foundation  Flow  (EDF).  The  EDF  allows  users  to  easily  get  EDI  working  by  having 
users  focus  on  what  files  they  need  instead  of  the  command  they  need  to  use.  Although  many 
designs  still  require  users  to  optimize  many  of  their  commands,  the  EDF  gets  users  started  and 
allows  them  to  focus  on  using  the  tool  efficiently.  An  example  design  is  shown  in  Figure  1  using 
65nm  IBM  technology  and  the  EDF: 


Figure  1,  Sample  Cadence  Design  Systems  Encounter  Tool  Screenshot 


To  help  the  design  flow  integration  and  EDF  work  together,  a  set  of  Makefile  scripts  are  created 
to  allow  all  the  tools  to  run  independently.  More  importantly,  the  Makefile  scripts  also  allow  log 
files,  report  creation,  and  storage  of  intermediate  files  that  allow  designs  to  be  preloaded  or 
stored  for  later  use.  Makefiles  are  important  in  computer-based  tools  in  that  they  are  text  files 
written  in  a  certain  prescribed  syntax  to  allow  software  to  organize  code,  its  compilation,  and 
production  of  subsequent  compilation.  The  Makefile  structures  uses  the  following  syntax  for 
each  compilation  step: 

1 .  init :  initialization  of  chip  and  reading  of  required  input  files 

2.  place  :  placement  of  chip  design  after  pre-placement.  Pre-placement  is  done  manually  to 
allow  the  chip  to  give  a  head  start  on  what  parts  are  better  to  be  placed  together  or  next  to 
each  other. 

3.  pre-cts  :  initial  clock-tree  synthesis  and  examination  of  the  elock  tree. 

4.  cts  :  clock- tree  synthesis  to  optimize  clock- tree 

5.  post-cts  :  examination  of  the  automatic  clock-tree  synthesis  and  whether  it  meets  specific 
chip  requirements. 

6.  route  :  routing  of  netlist  using  wires  based  on  cycle-time  optimization  and/or  power- 
based  considerations. 

7.  postroute  :  examination  of  cycle  time  to  see  min-time  constraints  (i.e.  hold-time). 

8.  final :  final-chip  connections  and  report  creation. 

Each  stage  of  the  Makefile  sequence  corresponds  to  a  stage  or  sequence  in  the  front-end  or  back¬ 
end  process.  The  design  flow  sequence  is  shown  in  Figure  2.  All  elements  within  the  Makefile 
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are  designed  to  allow  users  to  go  between  stages.  That  is,  a  user  could  theoretically  run  “init” 
and  then  stop.  The  Makefile  creates  specific  flow  outputs  that  are  utilized  for  the  EDA  tools  in 
creating  reports,  intermediate  files,  and  any  ancillary  file  helpful  in  the  design  flow. 


FLOW  INPUTS 


FLOW  OUTPUTS 

HDL  netlists 


Gate  level  oeffists 


Wire  del^s 


Switdiii^  activity 


Area,  Delay,  and  Power  reports 


Figure  2,  Design  Flow  Structure  for  System  on  Chip  Framework 


In  addition  to  a  Makefile,  many  of  the  scripts  in  the  design  flow  are  created  within  a  script-based 
language  that  most  EDA-vendors  support  called  Tool  Command  Language  (Tel).  John 
Ousterhout  invented  Tel  while  he  was  a  faculty  within  the  University  of  Berkeley  mainly  for  a 
public  domain  VLSI  tool  called  Magic  [15].  The  Tel  language  is  useful  in  that  it  has  an  easy-to- 
leam  format  and  also  works  well  with  Graphical  User  Interfaces  (GUIs)  from  most  EDA  tools. 
Moreover,  Tel  is  now  accepted  as  a  widely  supported  format  for  most  EDA  tools. 

Initially,  all  scripts  were  created  with  CDS’  EDI  tool  in  mind.  This  was  done  as  a  matter  of 
convenience,  but  the  ultimate  goal  is  to  have  the  best  EDA  tools  related  to  System  on  Chip 
design  integrate.  Currently,  there  are  two  popular  commercial  EDA  tools  that  support  SoC 
design;  Cadence  Design  Systems®  (CDS)  Encounter  Digital  Implementation™  (EDI)  and 
Synopsys®  IC  Compiler™  (ICC).  Both  tools  are  important  leaders  in  the  SoC  filed  in  that  EDI 
tends  to  be  used  more  for  optimizing  cycle  times,  whereas,  ICC  tends  to  be  used  for  low-power 
options. 

An  important  part  in  creating  this  research  is  allowing  the  EDA  tools  to  work  together.  An 
important  impediment  in  the  construction  of  the  SoC  framework  is  that  EDI  and  ICC  use 
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different  file  formats.  EDI  uses  a  format  based  on  something  ealled  the  Layout  Exehange 
Eormat  (LEE).  Although  CDS  has  moved  more  towards  a  new  format  ealled  Open  Aeeess  (OA), 
this  researeh  utilized  formats  based  on  the  LEE  file  structure.  This  was  done,  because  LEE  is  a 
format  that  is  common  in  the  VLSI  community  and  CDS  has  indicated  it  will  continue  to  support 
LEE  into  the  future.  On  the  other  hand,  Synopsys  ICC  uses  something  called  Milkyway  formats. 
Therefore,  it  was  important  to  create  scripts  that  allow  many  of  the  tools  to  work  together. 

To  help  with  the  creation  of  a  digital  design,  a  common  setup  Tel  file  is  utilized  to  allow  users  to 
add  information  for  the  IBM  65nm  and  52nm  technologies.  This  Tel  file  has  information  related 
to  core  Synopsys  file  geometry  and  area/delay/power  information.  To  allow  a  design  to  work 
within  Synopsys®  ICC,  a  common  file  structure  or  database  is  utilized  called  MilkyWay.  Before 
a  design  could  be  used  with  Synopsys  ICC,  the  Milkway  database  was  created  using  the 
command; 

milkyway  -galaxy  -nogui  -tel  -log  memory.log  one. tel 

As  stated  previously,  it  is  important  to  create  tools  that  work  together.  Custom-based  Makefiles 
are  created  to  create  the  same  structure  that  was  exhibited  within  the  EDI  tool  system  including 
the  naming  structure  (e.g.,  init,  place,  route).  The  Makefiles  are  integrated  into  one  structure 
and  allow  either  tool  to  flow  independently  from  each  other. 

EDA  tools  are  defined  to  work  with  different  file  structures.  Although  common  file  formats  are 
available  for  some  EDA  tools  (e.g.,  OpenAccess),  most  EDA  systems  tend  to  use  defined  file 
formats  that  are  unique  to  a  given  tool.  This  allows  companies  to  tie  commercial  tools  to  a  given 
company.  That  is,  if  a  company  devotes  2-5  years  into  a  SoC  design  and  it  uses  a  proprietary 
format  from  CDS,  it  is  less  likely  to  change  EDA  tools.  Typically,  Synopsys®  tools  use  Milkway 
databases,  whereas.  Cadence  Design  System®  use  Layout  Exchange  Eormat  (LEE)  formats.  To 
help  create  tools  that  allow  common  exchanging  of  data,  Tel  scripts  were  modified  to  output  the 
creation  of  Design  Exchange  Eormation  (DEE)  files. 

The  Design  Exchange  Eormat  (DEE)  format  is  an  open  specification  that  represents  physical 
information  of  a  design,  sometimes  called  a  layout,  in  an  American  Standard  Code  for 
Information  Exchange  (ASCII)  format.  As  opposed  to  LEE  files  that  only  have  geometry 
information  on  certain  layers,  DEE  files  contain  more  information  on  each  underlying  structure. 
Unfortunately,  DEE  files  usually  only  contain  netlists,  component  placements  and  its  routing 
information.  The  geometry  of  each  individually  placed  element  or  the  individualy  items  within 
each  SoC,  either  LEE  or  Milkyway  is  still  required.  Therefore,  it  was  important  that  each  tool 
output  only  DEE,  but  that  it  still  retains  the  structure  of  its  geometry  components  (i.e.,  LEE  and 
Milkyway). 

In  order  to  get  the  scripts  to  work  together,  the  scripts  were  modified  to  write  out  DEE 
information  at  each  independent  stage.  DEE  conveys  the  logic  design  data  as  well  as  the  physical 
information  to  either  EDI  or  ICC.  This  physical  information  includes  the  physical  placement 
locations,  orientations,  routing  geometry  data,  as  well  as  the  logic  design  changes  for  back 
annotation  of  the  design  to  any  netlist.  Although  the  DEE  works  well  in  converting  files  between 
multiple  EDA  tools,  it  does  not  help  with  the  setup  information.  Consequently,  it  is  important 
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that  the  setup  for  both  Synopsys  and  Cadenee  be  utilized  for  any  integration  and  the  tools  within 
this  work. 


Figure  3,  4  Core  Secure  Processor  Layout  using  System  on  Chip  Framework  EDA  tools 


Another  important  element  is  that  the  DEF  file  does  not  contain  the  file  format  that  each  tool 
utilizes.  Therefore,  it  is  important  that  if  a  design  is  to  be  translated  between  two  tools  (e.g.  EDI 
ICC),  it  must  have  an  initial  file  format  to  work  with.  Therefore,  a  design  should  have  an 
initial  fioorplan  to  work  with,  so  that  it  can  import  the  DEF  correctly.  Otherwise,  the  file  formats 
may  not  align  properly.  This  can  easily  be  achieved  by  initializing  the  place  &  route  process 
within  EDI  or  ICC  (e.g.,  the  init  phase)  and  then  importing  a  layout  from  DEF. 

During  the  course  of  this  research,  one  particular  design  was  taped  out  for  fabrication  using  both 
the  Cadence  Design  Systems®  and  Synopsys®  tools.  The  current  design  involves  four  cores  that 
involve  secure  processors.  Each  processor  has  a  set  of  memories,  drivers,  input/output  pins. 
First-in  First-Out  (FIFO)  buffers,  and  standard-cells.  Figure  3  shows  the  final  layout  before 
fabrication.  The  design  was  optimized  for  300  MHz  and  a  low  power  dissipation  footprint. 

3,2  System  on  Chip  Test  Chip  Environment 

This  work  also  integrated  a  low-power  standard  cell  library  flow  for  static  CMOS  logic, 
configurable  Random  Access  Memory  (RAM),  Read  Only  Memory  (ROM),  pad  and 
input/output  drivers,  and  cache  memory  generators  using  IBM  and  Virage  Logic  Design 
Intellectual  Property  designs.  The  SoC  components  are  integrated  into  the  high-level  synthesis 
framework  to  facilitate  the  design  of  complex  SoC  solutions.  In  developing  these  components, 
one  goal  is  to  utilize  the  design  framework  to  generate  different  functional  units  and  processors 
to  evaluate  potential  tradeoffs  in  key  design  decisions.  For  example,  adding  new  instructions  and 
functional  units  might  help  improve  the  performance  of  certain  3D  graphics  applications,  yet 
cause  an  unacceptable  increase  in  static  power  and  area.  Having  a  design  framework  that  can 
quickly  generate  and  evaluate  a  wide  variety  of  designs  will  assist  in  making  correct  decisions 
early  in  the  design  process. 
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A  layout  for  a  SoC  design  can  normally  be  simulated  including  its  underlying  parasitics.  This 
information  is  invaluable  to  this  project  since  decreasing  feature  sizes  and  increasing  chip  sizes 
make  the  influence  of  wiring  parasitics  on  the  performance  of  the  chip  critical  or  even  dominant. 
Therefore,  as  noted  by  International  Technology  Research  (ITRS)  roadmap,  gate  delays  may 
decrease,  but  wiring  delays  may  increase  as  minimum  feature  sizes  decrease  [6].  In  order  to 
guarantee  accurate  simulations,  two-dimensional  and  three-dimensional  extractions  were 
implemented  within  all  commercial  tools,  such  as  Cadence  Design  Systems’  Fire&Ice  and 
Mentor  Graphics’  Calibre,  for  the  proposed  SoC  design  tools  and  framework.  Having  detailed 
extraction  is  paramount  since  existing  design  flows  with  this  capability  are  minimal  or  non¬ 
existent.  Consequently,  each  design  can  then  be  fully  characterized  in  terms  of  area,  delay,  and 
power  dissipation  and  utilized  in  conjunction  with  a  flexible  simulation  framework  to  facilitate 
rapid  design  space  exploration  of  new  SoC  solutions  and  power  management  techniques.  Figure 
4  shows  sample  design  of  the  testbed. 

The  design  flow  is  then  integrated  along  with  the  parameters  cells  provided  by  IBM.  Cadence 
Design  System’s  Virtuoso  is  used  for  layout  creation  and  schematic  entry,  however,  the  design 
flow  is  designed  to  work  with  Synopsys  and  Cadence  Design  System  synthesis  engines.  Cadence 
Design  Systems’  EDI  and  Synopsys’  ICC  are  utilized  for  place  &  route,  and  Mentor  Graphics’ 
Calibre  is  used  for  Design  Rule  Checking,  Electrical  Rule  Checking,  Layout  Versus  Schematic, 
and  Parasitic  Extraction  (PEX).  The  PEX  is  extremely  important  for  power  dissipation 
estimation,  since  capacitance  in  the  wire  is  required  to  adequately  address  the  total  static  and 
dynamic  power  dissipation  within  a  design. 


Figure  4,  Sample  System  on  Chip  Testbed  for  Testing  and  Innovation 

3,3  Computer  Architecture  Simulation  and  Memory  Coherence/Consistency  Modeling 
Environment 

Computer  organization  is  typically  limited  by  the  speed  of  its  underlying  structure.  Eor  example, 
a  piece  of  silicon  has  a  certain  delay,  area,  and  power  property  that  is  contingent  on  how  it  is 
drawn  and  organized  when  manufactured.  Eor  most  computer  architecture’s  this  has  been  a 
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challenge,  beeause  most  parts  within  a  eomputer  arehiteeture  have  different  organizations  and 
internal  structures.  Typieally,  a  designer  has  to  balanee  these  eireuit  dependeneies  within  the 
technology  and  algorithmic  optimizations  for  a  given  computer  arehiteeture.  One  sueh  ehallenge 
is  the  proeess  of  designing  logic  within  computer  arehiteeture  to  proeess  data,  sueh  as  with  earry- 
propagate  adders,  and  its  interaction  with  storing  this  data. 

Most  computer  systems  store  this  data  within  pieees  of  logie  that  allow  data  to  be  read  and 
written  as  needed.  These  “memories”  are  erucial  to  any  eomputer  system  and  in  some  ways 
really  define  a  eomputer  system.  In  other  words,  a  eomputer  system  without  some  storage 
meehanism,  sueh  as  memory,  is  not  really  definitively  named.  Although  memories  have 
revolutionized  eomputer  systems  they  tend  to  have  problems  synehronizing  with  the  eomputation 
part  of  any  eomputer  system,  typieally  ealled  its  datapath.  This  oeeurs,  beeause  any  memory 
system  usually  relies  on  some  form  of  feedbaek  that  eonsumes  a  given  amount  of  time  to  resolve 
[17].  Sinee  memories  are  typieally  slower  than  most  datapaths,  they  tend  to  be  the  defining 
eomputer  element  for  speed  and  solving  problems  efficiently. 

Unfortunately,  the  gap  between  memory  performanee  and  that  of  the  datapath  has  grown  quite 
eonsiderably  over  the  last  twenty  years  [18].  Although  eomputer  designs  have  designed  new 
ways  to  ereate  better  memory  arehiteetures  they  still  are  outpaeed  for  every  datapath  design 
ereated  today.  To  eompensate  for  the  growing  “memory  gap”,  designers  have  hidden  this 
lagging  speed  within  an  inerease  in  the  proeessor  speed  [6].  To  aid  in  eompensating  speed  for 
performanee,  memory  designers  have  also  targeted  multiple  memory  hierarehies  to  help  faster 
memory  aeeesses  oecur  more  frequently.  That  is,  memory  with  faster  speeds  but  smaller  sizes 
are  stored  closer  to  the  datapath,  whereas,  memory  with  slower  speeds  and  larger  sizes  are 
loeated  further  aware  from  the  datapath.  Digital  logie  aids  the  transition  from  hierarehy  to 
hierarehy  within  the  memory  in  the  form  of  something  ealled  a  “eaehe”.  Mueh  to  the  dismay  of 
most  eomputer  designers,  the  speed  at  whieh  eomputers  ean  operate  is  reaching  a  theoretical 
limit  imposed  by  the  underlying  silieon  structure  of  the  proeessor. 

To  eompensate  for  the  limited  speed  at  whieh  eomputer  systems  ean  operate  at  eomputer 
designers  have  branehed  out  in  inereasing  the  performanee  of  systems  in  parallel.  In  other 
words,  instead  of  worrying  about  a  given  Instruetion  per  Cyele  (IPC)  performance,  designers 
have  worried  about  Threads  Level  Parallelism  (TLP).  A  thread  has  many  definitions,  but  in  its 
purest  definition  it  ean  be  labeled  as  a  baseline  operation  for  a  given  proeessor  [17].  To  help 
proeessors  shift  from  IPC  to  TLP,  proeessors  have  diffieult  tasks  in  synehronizing  memory 
within  multiple  hierarehies.  For  example,  if  a  eomputing  system  has  multiple  eores,  it  will 
typieally  have  multiple  caches  and  design  logic  within  a  computer  arehiteeture  must  make  sure 
these  eaehes  are  synehronized  aeross  multiple  eores. 

Most  eomputer  designers  identify  the  problem  of  synehronizing  memory  aeross  memory 
hierarehies  into  two  separate  elassifieations;  eoherenee  and  consisteney.  Coherenee  defines 
what  values  ean  be  returned  by  a  read  and  Consisteney  determines  when  a  written  value  will  be 
returned  by  a  read  [19].  Coherenee  and  Consisteney  are  eomplementary  in  that  eoherenee 
defines  the  behavior  of  reads  and  writes  to  the  same  memory  loeation,  whereas,  eonsisteney 
defines  the  behavior  of  reads  and  writes  with  respeet  to  aeeesses  [19].  This  basieally  boils  down 
to  making  sure  reads  and  writes  are  not  going  to  get  interrupted  or  happen  before  eaeh  other.  In 
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other  words,  memory  is  written  and  read  sueh  that  all  operations  are  atomic  or  can  be  done  in 
such  a  way  that  no  intervening  operation  can  occur. 

Traditionally,  coherence  has  been  handled  by  local  digital  logic  that  tracks  the  state  of  the  cache 
in  which  it  identifies.  The  earliest  textbooks  identifies  these  algorithms  as  adaptive  Finite  State 
Machines  (FSMs)  that  continuously  “snoop”  the  bus  and  watch  for  any  addresses  on  the  bus  are 
within  their  caches.  If  a  corresponding  cache  is  identified,  the  data  in  the  cache  is  either 
“invalidated”  or  removed,  otherwise  operations  progress  as  normal.  As  one  can  imagine,  every 
bus  transaction  must  check  the  cache  address  tag  and  potentially  interference  with  processor 
cache  accesses.  This  problem  is  exponentially  magnified  as  the  number  of  cores  is  increased, 
since  all  memory  operations  have  to  be  broadcasted  to  all  other  cores.  Therefore,  as  the  number 
of  cores  increase,  the  local  traffic  scales  and  reflects  a  significant  impact  upon  any  processor 
basic  metrics. 

Similar  to  digital  logic,  keeping  track  of  memory  accesses  can  be  limited  within  the  separation  of 
data.  That  is,  shifting  the  memory  by  distributing  the  memory  for  local  memory  traffic  and  from 
other  areas.  That  is,  make  a  hierarchical  system  of  memory  tracking  similar  to  how  carry- 
lookahead  adders  distribute  carries  across  a  network  [17].  To  accomplish  this  task  it  is 
sometimes  easier  to  not  use  a  Finite  State  Machine  snooping  protocol  and  distribute  the  memory 
within  a  table  using  a  directory  protocol  [19].  This  directory  keeps  the  state  of  every  block  that 
may  or  may  not  be  cached  as  well  as  which  caches  or  collection  of  caches  have  copies  of  the 
block,  whether  it  is  dirty,  and  so  on.  They  key  is  to  distribute  the  directory  so  that  the  coherence 
protocol  knows  where  to  find  the  directory  information  for  any  cached  block  of  memory. 
Subsequently,  distributing  the  directory  also  enables  different  coherence  requests  to  go  to 
different  directories. 

As  with  any  commercial  system,  the  problem  for  most  designers  is  that  there  is  nothing  available 
to  help  researchers  develop  these  methods.  One  of  the  goals  in  this  work  is  to  integrate  the  ideas 
into  the  current  secure  processor  multiple  core  design  that  the  Air  Force  Research  Laboratory  is 
currently  utilizing.  Therefore,  to  help  implement  the  design,  a  Verilog  implementation  was 
researched  and  a  directory  protocol  was  integrated  to  help  in  coherence.  The  problem  is  also 
compounded  with  the  fact  that  any  process  or  heavyweight  thread  could  potentially  be  composed 
of  more  threads  within  each  process.  This  is  typically  called  a  lightweight  process  and  makes  the 
problem  of  multithreading  more  difficult,  because  a  single  program  can  be  made  up  of  a  number 
of  different  concurrent  activities.  This  make  the  problem  of  consistency  within  current  computer 
architectures  difficult  to  describe  and  even  more  difficult  to  handle  coherence.  The  problem  is 
really  broken  down  into  what  properties  must  be  enforced  among  reads  and  writes  to  different 
locations  by  different  processors. 

To  help  this  process,  researchers  have  considered  using  the  directory  structure  and  relaxing  to 
make  sure  all  reads  and  writes  complete  and  are  synchronized  by  their  ordering.  That  is, 
operations  for  read  and  write  are  atomic.  This  is  typically  called  a  relaxed  consistency  model 
and  the  directory  structure  considered  for  this  work  comprises  a  relaxed  method  for  ordering  the 
writes  after  reads,  reads  after  writes,  writes  after  writes,  and  finally  reads  after  reads. 

As  stated  previously,  it  is  difficult  to  implement  systems  that  have  complex  multiprocessor 
behavior  and  also  contain  multiple  cores.  Therefore,  a  pipelined  Microprocessor  without 
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Interlocked  Pipeline  Stages  (MIPS)  model  is  utilized  that  is  well  studied  and  has  been  already 
designed  by  the  author  in  Verilog  [18].  The  Verilog  model  contains  4  cores  and  is  fitted  with  the 
Virage  Memory  architecture  that  the  current  AFRL  secure  processor  utilizes.  Each  core  has  its 
own  cache,  but  the  L2  cache  has  a  directory  structure  locally  to  keep  track  of  all  reads  and  writes 
within  the  processor.  Each  directory  structure  that  implements  the  coherence  protocol  is 
simplified  and  is  also  based  on  similar  architectures  by  the  Sun  Niagara  T2  architecture  that  is 
available  on  the  Internet  [20].  The  basic  architecture  is  shown  in  Eigure  5.  The  directory  is 
integrated  within  the  L2  cache  similar  to  the  Sun  Niagara  T2  and  each  directory  has  an  arbiter 
that  allows  multiple  elements  to  communicate  with  the  other  cores  distributively. 
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Figure  5,  Basic  MIPS  MultiCore  Directory  Architecture 

The  Verilog  model  is  extensively  verified  using  several  sample  programs  and  testbenches.  All 
testbenches  verified  that  the  coherence  and  consistency  protocols  work  well  within  the  context  of 
the  Verilog  implementation.  The  Verilog  is  also  written  using  Register  Transfer  level  (RTE)  - 
level  constructs  that  can  be  further  synthesized  and  placed  &  routed  using  the  SoC  design 
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framework  discussed  in  Section  IIB.  An  important  part  of  the  Verilog  model  is  that  it 
incorporates  a  simplified  MIPS  model.  This  allows  a  user  to  study  the  design  well,  since  the 
MIPS  is  well  understood  and  a  basis  for  most  modem  computer  architectures.  Consequently, 
once  a  user  can  understand  a  given  protocol  for  the  directory  stmcture,  it  is  anticipated  that  it  can 
be  used  in  future  secure  architectures  within  the  AFRL  that  might  use  shared  memory  accesses. 
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4.0  RESULTS 


The  design  flow  is  ereated  in  sueh  a  way  that  environmental  variables  are  utilized  to  enable  it  to 
be  installed  anywhere.  For  example,  all  the  libraries  refer  to  an  environmental  variable  within 
the  eommon  setup  Tel  file  diseussed  earlier.  The  IBM  emoslOlpe  65nm  SoC  framework  utilizes 
standard-eells  and  memories  from  Synopsys®  (formerly  Virage  Logie).  On  the  other  hand,  the 
IBM  emos32soi  52nm  SoC  framework  uses  standard  eells  from  ARM  and  memories  from  IBM. 
Both  implementations  utilize  MTCMOS-based  standard-eell  teehnologies.  Although  the 
standard-eell  implementations  have  been  extensively  tested,  the  memories  are  still  not  eompleted 
tested  for  IBM’s  omos32soi  52nm  teehnology.  This  is  beeause  they  were  not  available  for  the 
eontext  of  this  work  from  IBM.  However,  they  ean  easily  be  inserted  by  modifying  the 
eommon  setup  Tel  file  onee  the  memory  beeomes  available. 

Seripts  are  designed  to  work  with  the  design  flow,  so  that  any  design  ean  be  easily  ereated  from  a 
HDL  design  into  a  mask  layout.  All  standard-eell  designs  and  memory  elements  are  ereated 
using  seripts  that  either  generate  a  plaeed  and  route  design  or  initiate  a  language  within  the 
Cadenee  Design  System®  or  Synopsys®  EDA  tools  to  ereate  a  memory  array. 

Power  dissipation  is  an  important  element  within  any  eomputer  arehiteeture,  however,  the 
eomputation  of  power  ean  be  eomplieated  by  the  level  of  extraetion  that  oeeurs  for  a  given 
design.  Consequently,  the  SoC  framework  and  design  flow  presented  here  has  several  different 
power  level  estimation  tools  that  ean  either  estimate  the  power  during  synthesis  with  no  parasitie 
extraetion  of  the  wires  (e.g.  through  Synopsys®  PowerCompiler™  or  PrimeTime'^’’^)  or 
eomputed  more  aeeurately  using  an  extraeted  SPEF  file  (e.g.  through  Synopsys®  nanosim™). 
Seripts  are  erated  that  allow  all  designs  to  be  tested  effortlessly  for  timing,  area,  and  power 
parameters.  More  importantly,  all  the  seripts  ean  be  modified,  so  that  power,  delay,  or  area  ean 
be  targeted  as  an  optimized  eonstraint  for  a  given  eomputer  arehiteeture.  All  designs  were 
ereated  and  designed  effortlessly  through  the  use  of  seripts  designed  to  interfaee  the  HDL 
definitions. 

The  designs  presented  in  this  researeh  are  designed  to  allow  a  user  to  foeus  on  a  partieular  item 
for  optimization,  sueh  as  Signal  Integrity,  and  play  with  design  parameters  to  enhanee  its 
understanding  and  potentially  go  between  different  EDA  tools.  Right  now,  only  Synopsys®  ICC 
and  Cadenee’s®  EDI  are  utilized,  but  this  eould  be  expanded  to  other  tools,  if  needed.  The 
designs  are  all  self-eontained  with  the  direetory  and  ean  be  remade  through  their  Makefiles  or  by 
typing,  “make  all”.  Sinee  eaeh  design  is  eomposed  of  smaller  implementation  design 
arehiteetures,  eaeh  design  runs  with  minimum  run  time  and  system  resourees.  After  eaeh  step  is 
run,  a  DEE  file  is  saved  at  the  end  to  allow  eonversion  between  eaeh  tool. 

All  of  the  designs  are  ereated  with  the  eutting-edge  and  most-reeent  EDA  software  provided  by 
EDA  tool  vendors.  That  is,  the  proeess  of  going  from  design  arehiteeture  to  deployable  silieon, 
or  baek-end  proeess,  is  utilized  from  Cadenee  Design  Systems  and  Synopsys  separately  and  self- 
eontained  within  one  set  of  seripts.  Cadenee  Design  Systems®  or  CDS  utilizes  Eneounter  Digital 
Implementation™  and  Synopsys®  employs  IC  Compiler^M.  That  is,  a  full  set  of  seripts  from 
both  manufaeturers  has  been  ereated  for  this  work  in  hopes  that  they  ean  be  explored  and  utilized 
separately  or  eombining  them.  Sinee  the  front-end  proeess  or  the  synthesis  portion  of  the  design 
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flow  is  common  to  both,  both  systems  employ  Synopsys®  Design  Compiler™  or  DC  for  turning 
the  SoC  arehitecture  into  an  implementation  based  on  a  teehnology-mapped  netlist  (i.e.,  the 
front-end  proeess).  Moreover,  the  simulation  tool  ModelSim  from  Mentor  Graphics  is  also 
employed  in  each  design  for  HDL  simulation  and  verifieation,  since  it  is  a  rich  and  full  featured 
to  contain  most  ideas  within  simulation. 

All  designs  are  eontained  on  the  Oklahoma  State  University/ AFRL  research  server.  It  is  also 
antieipated  that  many  Air  Foree  Research  Laboratory  personnel  ean  utilize  the  research  presented 
in  this  report  for  further  study  and  other  projects. 
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5.0  CONCLUSIONS 


This  paper  demonstrates  the  researeh  that  was  eonducted  for  the  Air  Foree  Researeh  Laboratory. 
With  the  eombination  of  arehiteeture  and  eireuit-based  enhaneements  and  a  eommon  set  of 
software  design  flows,  allowing  a  eohesive  SoC  framework  that  produees  designs  that  are  several 
orders  of  magnitude  better  than  implementing  them  manually  or  without  the  framework.  More 
importantly,  the  tools  have  been  created  that  allow  them  to  create  multiple  design  seamlessly  as 
well  as  create  an  agile  environment  for  the  deployment  of  VLSI  computer  architectures  in 
silicon. 

The  design  flow  is  designed  to  work  out-of-the-box  and  allow  anyone  to  design  fast  and  efficient 
designs  with  the  IBM  cmosIOsf  65nm  or  IBM  cmos32soi  52nm  technologies.  There  is  still 
significant  additional  work  that  can  be  done  by  incorporating  better  variation  control  within  a 
design,  power  gating,  clock  gating,  and  utilizing  Synopsys®  Unified  Power  Format  language. 
This  language  is  useful  in  allowing  many  designs  to  optimize  energy  and  power  consumption 
within  mask  layouts.  Future  work  will  be  ongoing  in  this  area  and  produce  better  and  more 
productive  computer  architectures  with  multiple-cores  and  better  security  enhancements. 
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LIST  OF  SYMBOLS,  ABREVIATIONS  AND  ACRONYMS 


AFRL 

Air  Force  Research  Eaboratory 

ASCII 

American  Standard  Code  for  Information  Exchange 

CDS 

Cadence  Design  System 

CMOS 

Complementary  Metal  Oxide  Semiconductor 

DBF 

Design  Exchange  Format 

EDA 

Electronics  Design  Automation 

EDI 

Encounter  Digital  Implementation 

FPGA 

Field  Programmable  Gate  Array 

FSM 

Finite  State  Machine 

GUI 

Graphical  User  Interface 

HDE 

Hardware  Descriptive  Eanguage 

ICC 

Integrated  Circuit  Compiler 

IPC 

Instructions  Per  Cycle 

ISUB 

sub-threshold  current 

ITRS 

International  Technology  Research 

EEF 

Eayout  Exchange  Format 

MIPS 

Microprocessor  without  Interlocked  Pipeline  Stages 

MTCMOS 

Multi-Threshold  CMOS 

NMOS 

Negative  Metal  Oxide  Semiconductor 

NP 

Non-deterministic  polynomial 

NSF 

National  Science  Foundation 

OA 

Open  Access 

PMOS 

Positive  Metal  Oxide  Semiconductor 

RAM 

Random  Access  Memory 

ROM 

Read  Only  Memory 

RTF 

Register  Transfer  level 

SoC 

System  on  a  Chip 

Tcl 

Tool  command  language 

TEP 

Threads  Eevel  Parallelism 

VGS 

gate-source  voltage 

VESI 

Very  Earge  Scale  Integration 

Vt 

threshold  voltage 
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